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Introduction 

In  multiplier-less  hardware  implementations  of  DSP  transforms,  multiplication-by-constants  are  imple¬ 
mented  as  a  network  of  (wire-)shifts  and  additions.  The  number  of  additions  required  can  be  reduced  by 
approximating  the  multiplicative  constants  using  lower  precision  fixed-point  representations,  but  the  loss  of 
precision  increases  the  numerical  error  in  the  implementation.  This  trade-off  can  be  leveraged  to  reduce  the 
hardware  area,  critical  path  and  power/energy  while  maintaining  the  perceptible  quality  of  a  signal  process¬ 
ing  application  (e.g.,  MPEG-4).  This  paper  describes  an  automatic  approach  to  minimize  the  number  of 
additions  subject  to  a  given  quality  measure,  or,  vice-versa,  to  maximize  the  quality  subject  to  a  given  num¬ 
ber  of  available  additions.  Our  automatic  approach  can  handle  linear  DSP  transforms  in  general  and  includes 
optimizations  over  the  space  of  algorithm  design.  A  Verilog  backend  generates  synthesizable  descriptions 
of  the  final  variable-width  fixed-point  implementations. 

Approach 

We  consider  the  following  two  optimization  problems  for  a  given  linear  DSP  transform:  (1)  Given  a  quality 
threshold  Q,  find  the  multiplierless  implementation  with  the  least  arithmetic  cost  C  that  satisfies  Q;  (2) 
Given  an  arithmetic  cost  threshold  C,  find  the  multiplierless  implementation  with  the  highest  quality  Q. 
Our  proposed  system  automatically  solves  this  problem  in  the  following  steps.  We  consider  problem  (1); 
problem  (2)  is  analogous. 

Given  is  a  formally  specified  linear  DSP  transform  T  (e.g.,  a  DCT  of  size  8)  and  the  quality  threshold  Q 
(e.g.,  the  maximum  allowed  error  of  the  output). 

Step  1:  Generating  a  Fast  Algorithm.  First,  we  generate  a  fast  algorithm  for  T  represented  as  a 
formula  in  a  mathematical  notation  using  SPIRAL'.  The  formula  is  built  from  few  constructs  and  primitives 
such  as  the  Kronecker  product  permutation  matrices,  or  2  x  2  rotations  Rq,.  For  example,  one  out  of 
many  possible  formulas  for  the  DCT  of  size  8  looks  like 

DCTg  =  [(2,5)(4,7)(6,8),8]  •(diag(l,^)©R3^©Ri5^©R2i^) 

•[(2, 4,  7,  3,  8),  8]  •  ((Fa  ©I3)  ©  I2)  •  (I4  ©  ©I2)  •  [(2, 3, 4, 5, 8, 6,  7),  8] 

•(I2  ©  ((F2©l2)  •  [(2, 3), 4]  •  (I2  ©Fa)))  •  [(1,8,6,2)(3,4,5,  7),8]. 

Step  2:  Manipulation  for  Numerical  Stability.  In  the  second  step,  we  formally  manipulate  the  formula 
to  increase  its  numerical  stability,  which  determines  how  quick  the  quality  of  T  degrades  when  implemented 
in  low  precision.  In  particular,  we  expand  the  formula  into  lifting  steps  using  ideas  from^. 

*The  work  of  Markus  Piischel  was  supported  by  DARPA  through  research  grant  DABT63-98- 1-0004  administered  by  the  Army 
Directorate  of  Contracting  and  by  NSF  through  award  9988296 

*  J.  Moura,  J.  Johnson,  R.  Johnson,  D.  Padua,  V.  Prasanna,  M.  Piischel,  B.  Singer,  M.  Veloso,  and  J.  Xiong.  Generating  Platform- 
Adapted  DSP  Libraries  using  SPIRAL.  In  Proc.  HPEC,  2001.  http  :  / /www .  ece  .  emu .  edu/~spiral. 

^J.  Liang  and  T.D.  Tran.  Fast  Multiplierless  Approximations  of  the  DCT  With  the  Lifting  Scheme.  In  IEEE  Transactions  on 
Signal  Processing,  Vol.49,  No. 12,  Dec  2001,  pages  3032-3044. 


1 


Report  Documentation  Page 

Form  Approved 

0MB  No.  0704-0188 

Public  reporting  burden  for  the  collection  of  information  is  estimated  to  average  1  hour  per  response,  including  the  time  for  reviewing  instructions,  searching  existing  data  sources,  gathering  and 
maintaining  the  data  needed,  and  completing  and  reviewing  the  collection  of  information.  Send  comments  regarding  this  burden  estimate  or  any  other  aspect  of  this  collection  of  information, 
including  suggestions  for  reducing  this  burden,  to  Washington  Headquarters  Services,  Directorate  for  Information  Operations  and  Reports,  1215  Jefferson  Davis  Highway,  Suite  1204,  Arlington 

VA  22202-4302.  Respondents  should  be  aware  that  notwithstanding  any  other  provision  of  law,  no  person  shall  be  subject  to  a  penalty  for  failing  to  comply  with  a  collection  of  information  if  it 
does  not  display  a  currently  valid  0MB  control  number. 

1.  REPORT  DATE 

20  AUG  2004 

2.  REPORT  TYPE 

N/A 

3.  DATES  COVERED 

4.  TITLE  AND  SUBTITLE 

Custom  Reduction  of  Arithmetic  in  Linear  DSP  Transforms 

5a.  CONTRACT  NUMBER 

5b.  GRANT  NUMBER 

5c.  PROGRAM  ELEMENT  NUMBER 

6.  AUTHOR(S) 

5d.  PROJECT  NUMBER 

5e.  TASK  NUMBER 

5f.  WORK  UNIT  NUMBER 

7.  PERFORMING  ORGANIZATION  NAME(S)  AND  ADDRESS(ES) 

Electrical  and  Computer  Engineering,  Carnegie  Mellon  University, 
Pittsburgh,  PA 

8.  PERFORMING  ORGANIZATION 

REPORT  NUMBER 

9.  SPONSORING/MONITORING  AGENCY  NAME(S)  AND  ADDRESS(ES) 

10.  SPONSOR/MONITOR’S  ACRONYM(S) 

11.  SPONSOR/MONITOR’S  REPORT 
NUMBER(S) 

12.  DISTRIBUTION/AVAILABILITY  STATEMENT 

Approved  for  public  release,  distribution  unlimited 

13.  SUPPLEMENTARY  NOTES 

See  also  ADM001694,  HPEC-6-VoI  1  ESC-TR-2003-081;  High  Performance  Embedded  Computing 
(HPEC)  Workshop  (7th).,  The  original  document  contains  color  images. 

14.  ABSTRACT 

15.  SUBJECT  TERMS 

16.  SECURITY  CLASSIFICATION  OF: 

17.  LIMITATION  OF 
ABSTRACT 

uu 

18.  NUMBER 
OF  PAGES 

30 

19a.  NAME  OF 
RESPONSIBLE  PERSON 

a.  REPORT 

unclassified 

b.  ABSTRACT 

unclassified 

c.  THIS  PAGE 

unclassified 

Standard  Form  298  (Rev.  8-98} 

Prescribed  by  ANSI  Std  Z39-18 


Abstract 


Presentation 


Back  to  Agenda 


Figure  1:  Evolutionary  optimization.  Left:  for  DCTg  minimizing  number  of  additions  for  various  given 
coding  gains  (eg);  right:  for  DFTig  optimizing  convolution  error  for  various  given  numbers  of  additions. 


Step  3:  Constant  Reduction  and  Search.  In  this  step  the  actual  constrained  optimization  is  performed 
using  an  automated  search.  The  idea  is  to  replace  each  occurring  constant  (multiplication)  in  the  formula 
by  a  low-precision  version,  specified  by  the  number  of  bits  in  {0, 1, . . .  hmax}-  Doing  so  for  every  constant 
yields  an  approximation  T  of  the  original  transform  T;  T  has  a  lower  cost  C  than  the  original,  i.e.,  requires 
less  additions.  If  the  formula  for  T  contains  n  multiplier  constants  a\, . . . ,  Un,  there  are  {bmax  +  1)”  many 
ways  of  approximation,  which  determines  the  search  space  for  our  optimization.  Since  an  exhaustive  search 
is  infeasible  we  use  evolutionary  and  greedy  search  techniques  to  find  fhe  approximafion  wifh  fhe  lowesf 
cosf  (leasf  number  of  addifions)  fhaf  sfill  satisfies  fhe  qualify  fhreshold  Q. 

Step  4:  Mapping  to  Verilog.  In  this  final  step  we  map  the  found  (approximated)  formula  into  Verilog. 

We  note  that  in  the  above  the  approach,  the  formula,  i.e.,  algorithm  chosen  for  the  transform  was  fixed. 
The  optimization  can  readily  be  extended  to  include  the  space  of  different  possible  formulas  into  the  opti¬ 
mization  using  spiral’s  formula  generator. 

Experimental  Results 

We  show  two  examples  for  two  different  optimization  problems  for  the  discrete  cosine  transform  (DCT)  and 
discrete  Fourier  transform  (DFT). 

DCT,  size  8.  We  chose  as  quality  measure  coding  gain  (eg)  in  dB,  which  for  the  exact  (infinite  precision) 
DCT  is  about  8.8259.  We  considered  one  formula  for  the  DCT  generated  by  SPIRAL  (similar  to  the  one 
above).  A  10-bit  multiplierless  implementation  for  this  formula  requires  56  additions.  After  formula  manip¬ 
ulation,  we  considered  9  constants  in  the  formula  for  further  approximation,  which  yields  a  search  space  of 
size  Figure  1  (left)  shows  the  results  of  an  evolutionary  search  for  various  eg  thresholds.  The  abscissa 
shows  the  generations  in  this  search,  the  ordinate  the  found  solution  with  the  least  cost.  For  example,  after 
100  generations,  for  eg  =  8.81  a  solution  with  only  31  adders  was  found.  The  search  took  30  minutes. 

DFT,  size  16.  We  chose  as  quality  measure  the  convolution  error  (ce),  which  determines  to  what  extent 
the  DFT’s  convolution  property  is  violated.  The  exact  DFT  has  ce  =  0.  Again  we  considered  one  particular 
formula,  whose  10-bit  implementation  requires  256  adders.  Figure  1  (right)  shows  the  results  for  fixing  the 
number  of  additions  and  optimizing  the  achievable  quality.  For  example,  by  allowing  170  adders,  a  solution 
with  ce  =  0.341  was  found  after  150  generations.  The  search  took  2  hours. 
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Dept,  of  Electrical  and  Computer  Engineering 
Carnegie  Mellon  University 


Research  Overview 


♦  Linear  DSP  transforms 

-  e.g.  DFT,  DCTs,  WHT,  DWTs,  .... 

-  ubiquitously  used,  often  in  computation  intensive  kernels 

-  comprised  of  additions  and  multiplication-by-constant 

-  applications:  multimedia,  bio-metric,  image/data  processing  . . . . 

♦  Light-weight  hardware  impiementations 

-  fixed-point  data  format 

-  multiplierless:  mult-by-constant  as  shifts  and  adds 

-  problem  1 :  output  quality  reduced  by  cost-saving  measures 

(reducing  the  bitwidth  of  data  and  constants) 

-  problem  2:  different  applications  have  vastly  different  quality 
metric  and  requirements 

^  need  application  specific  tuning 

Our  Goal:  automatic,  custom  reduction  of  arithmetic 
(additions)  w.r.t.  a  given  application’s  requirements 
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Our  Automatic  Flow 


DSP  transform 


1 


algorithm  selection 
(robust,  structure) 

rotation  based 
algorithm 

^algorithm  (as  formula) 

Ci. 

algorithm  manipulation 
(robustness) 

expansion  into 
lifting  steps 

1  algorithm  (as  formula) 

Uj 

search  for  cheapest  const, 
reduction  satisfying  Q 

quality 

search:  constant 
reduction 

T 


constraint 


custom  low-cost 
algorithm 


DCT,  size  32,  in 
MPEG  decoder 

A 


T 


'MPEG 
compliance  test 


custom  low-cost 
algorithm 
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Related  Work 


♦  Liang/Tran,  “Fast  Multiplierless  Approximation  of  the  DCT  with  the  Lifting 
Scheme,”  IEEE  Trans.  Sig.  Proc.,  49(12)  2001,  pp.  3032-3044 

examined  arithmetic  cost  reduction  for  DCT  size  8 
steps  performed  by  hand,  exhaustive  search 

♦  Fang/Rutenbar/Puschel/Chen,  “Toward  Efficient  Static  Analysis  of  Finite- 
Precision  Effects  in  DSP  Applications  via  Affine  Arithmetic  Modeling,”  Proc. 
DAC  2003 

efficient  static  analysis  of  output  error  (hard  and  probabilistic) 
range  of  input  values  used/needed 
analysis  assumes  a  common  global  bitwidth 

♦  Puschel/Singer/Voronenko/Xiong/Moura/Johnson/Veloso/Johnson,  “SPIRAL 
system”,  www.spiral.net 

automatic  generation  of  custom  runtime  optimized  DSP  transform  software 
provides  implementation  environment  for  our  approach  (in  particular  algorithm 
generation  and  manipulation) 
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Outline 


♦  DSP  transform  algorithms 

♦  Algorithm  manipulation  for  robustness 

♦  Multiplication  by  constants 

♦  Search  Methods 

♦  Results 
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DSP  Algorithms  as  Formulas: 
_ Example  DFT  size  4 


Cooley/Tukey  FFT  (size  4): 
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Fourier  transform  Diagonai  matrix  (twiddies) 

I  I 

DFT^  =  (DFT^  ®  4)  •  diag  (1,1,1,  i)  ■  (I^  0  DFT^)-  [(2,3),4] 

I  I  I 

Kronecker  product  Identity  Permutation 

allows  for  computer  generation/manipulation 

(provided  by  SPIRAL) 
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Example:  DCT  size  8 


[(2,5)(4,7)(6,8),8] 

■  {diag  (1,1  /  V2 )  ©  ) 

■  [(2,4,7,3,8),8]  •  {{DFT,  [(5,6),8] 

•  (/,  ©  1  /  V2  •  DFT^  ©  /^ )  •  [(2, 3,4,5, 8,6, 7), 8] 

■{I,®{{DFT,®I^)-[{2,'i)AV{l2  ®DFTA)) 

■[(1,8,6,2)(3,4,5,7),8] 

as  formula 

(generated  by  SPIRAL) 


Basic  building  blocks: 

-  2  X  2  rotations,  DFT_2's  (butterflies),  permutations,  diagonal  matrices  (scaling) 

Algorithm  is  orthogonal  =  robust  to  input  errors  (from  fixed  point  representation) 
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Outline 


♦  DSP  transform  algorithms 

♦  Algorithm  manipulation  for  robustness 

♦  Multiplication  by  constants 

♦  Search  Methods 

♦  Results 
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Fixed  Point  Error:  Data  v^.  Transform 


Implementing  a  transform  xa  Tx\n  fixed  point  arithmetic 
produces  two  type  of  errors: 

♦  Error  in  input  x:  llx-xll 

-  from  rounding  of  the  input  coefficients  xto  the  fix-point  data 
representation  x 

-  for  robustness:  choose  orthogonal  algorithms 

♦  Error  in  transform:  WT-TW 

-  from  finite  precision  multiplication  by  constants 

further  approximation  is  a  source  of  savings  in 
multiplierless  Implementations 

-  for  robustness:  translate  algorithm  into  lifting  steps 
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Lifting  Steps 


♦  Lifting  step  (LS): 


1 

X 

or 

1 

0 

0 

1 

1 

-  invertible  (det  =  1 )  independent  of  approximation  of  x,  y 

-  inverse  of  LS  is  also  LS  (with  -x,  -y) 

if  LS  is  cheap,  then  so  is  its  inverse 

♦  Rotation  as  lifting  steps  ,,^/X 


cos  a  sin  a 

© 

1 _ 

'1 

'1 

-  sin  a  cos  a 

0  1 

3 

_0 

1 

1-cosctf  a 

p  = - =  tan  — ,  u  =  —sma 

sin  a  2 


rotation  based  algorithms  can  be 
automatically  expanded  into  LS 
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Error  Analysis 


♦  rounding  error  in  the  first  lifting  step  (third  LS  analogous) 


1  e 

“1  0“ 

1  P 

—  R  + 

(^e  sirra)(^cos  ^ 

^  not  magnified 

0  1 

u  1 

0  1 

cx 

0  0 

♦  rounding  error  in  the  second  lifting  step 


"1 

p 

"1 

0" 

"1 
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(gtan^ 
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0 

1 

oc 

£  ^tan^ 

L 

e  is  magnified,  unless  a  in  [0, 7u/2]  or  [37r/2,  27t] 
Solution:  angle  manipulation 
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Ensuring  Robustness 


Steps  to  ensure  robustness 

♦  Choose  algorithms  based  on  rotations 

♦  Manipulate  angles  of  rotations 

♦  Expand  into  lifting  steps 

Done  automatically  as  formula  manipulation 
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Outline 


♦  DSP  transform  algorithms 

♦  Algorithm  manipulation  for  robustness 

♦  Multiplication  by  constants 

♦  Search  Methods 

♦  Results 
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Multiplication  by  Constants 


Operations  in  transforms: 

y  =  x^  +  X2  additions 

y  —  cx  multiplication  by  constant 

Example: 

simple  c=0.10111011  =  5  adds  (5  shifts) 

SD  recoding!  c=0.1100110T  4  adds  (3  shifts) 

SD  recoding  2  c=0.11000ToT  3  adds  (3  shifts) 

SD  recoding  is  not  optimal 
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Addition/Subtraction  Chain 


♦  Provide  optimal  solution  for 
constant  mult  using  adds  and  shifts 

♦  Finding  the  optimal  addition  chain 
is  a  hard  problem 

♦  A  near  optimal  table  of  solutions 
can  be  computed  using  dynamic 
programming  methods* 

♦  For  all  constants  up  to  2'® 

-  only  225  constants  require 
more  than  5  additions 
(214@6,  11  @7) 


c=0.10111011 

3  adds  (3  shifts) 
0.10110000  \ 


0.00001010 
x2 

0.00000101 


0.00000001 


0.00000100  0.00000001 


*Sebastian  Egner,  Philips  Research,  Eindhoven 
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SD  recoding  Addition  Chains 


0. 


Histogram  of  addition  cost  for  all  constants  between  1  and  2^^ 


X  10 


Misra,  Zelinski,  Hoe,  Puschel,  CMU/ECE 


HPEC  2003,  Slide  16 


Outline 


♦  DSP  transform  algorithms 

♦  Algorithm  manipulation  for  robustness 

♦  Multiplication  by  constants 

♦  Search  Methods 

♦  Results 


Misra,  Zelinski,  Hoe,  Puschel,  CMU/ECE 


HPEC  2003,  Slide  17 


Optimization  Problem 


Given  a  linear  DSP  transform  and  quality  measure  Q 

1 .  Find  the  multiplierless  implementation  with  the  least 
arithmetic  cost  C  (number  of  additions)  that  satisfies  a 
given  Q  threshold 

2.  Find  the  multiplierless  implementation  with  the  highest 
quality  Q  for  a  given  arithmetic  cost  C  threshold 
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Quality  Measures  of  Transforms 


For  an  approximation  7"  of  a  transform  T. 

♦  Transform  independent  Q 

-  lU-ril  for  some  norm  II  •  I 

♦  Transform  dependent  Q 

-  coding  gain  for  DCT 

-  convolution  error  for  DFT 

♦  Application-based  Q 

-  MPEG  standard  compliance  test 
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Search  Space:  approximating 

multiplicative  constants 


♦  For  each  multiplication-by-constant  in  the  transform 
choose  custom  bitwidth  /e  [ok  ^-i] 

-  Given  n  constants,  ^"configurations  are  possible 

♦  But,  for  a  given  constant,  not  all  k  configurations  lead 
to  different  cost, 

e.g.,  given  5-bit  constant  0.1 1 1 01 ,  SD  recoding  gives 
5-bit  =  .11 101  =1.00i01  ^2  adds 

4-bit  =  .11 10  =1.0010  ^ladds 

3-bit  -  .111 - -  I.OOT - 1  adds 

2-bit  =  .11 - =  0.1 1 - ^  1  adds 

1-bit  =  .1  =0.1  ^  0  adds 

0-bit  =  0  =0  =>  0  adds 

Recall  all  constants  up  to  1 9-blts  can  be  reduced  to  5  adds 
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Search  Methods 


♦  Global  Bitwidth 

-  all  constant  assigned  the  same  bitwidth 

-  very  fast  (small  search  space),  but  only  works  well  in  some  cases 

♦  Greedy  Search 

-  starting  with  maximum  bitwidth,  in  each  round,  choose  one  constant 
to  be  reduced  by  1-bit  that  minimizes  quaiity  loss 

(also  go  bottom-up  instead  of  top-down) 

-  local  minima  traps  are  possible 

♦  Evolutionary  Search 

-  start  with  a  population  of  random  configurations 

-  in  each  round 

1 .  breed  a  new  generation  by  crossbreeding  and  mutations 

2.  select  from  generation  the  fittest  members 

3.  repeat  new  round 

-  local  minima  traps 
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Outline 


♦  DSP  transform  algorithms 

♦  Algorithm  manipulation  for  robustness 

♦  Multiplication  by  constants 

♦  Search  Methods 

♦  Results 
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Interaction  between 

Transforms,  Q  and  Search 


♦  Goal:  given  a  transform  and  a  required  Q  threshold,  find  an 
approximation  to  the  transform  that  requires  the  fewest  additions 

♦  Transforms  and  Q  tested 


Transform 

Quality  Threshold 

8-pt.  DCT-II 

8.82  dB  coding  gain  (eg) 

16-pt.  DFT 

Convolution  error  =  1 

32-pt.  DCT-II 

Limited  Compliance  (LC) 
MP3  decoder* 

18x36  IMDCT 

LC  MP3  decoder* 

♦  3  searches  methods  were  compared 

♦  entire  framework  implemented  as  part  of  SPIRAL  (www.spiral.net) 

*MAD  Decoderby  Robert  Mars,  http://www.underbit.com/products/mad 
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Example:  Evolutionary  Search 


Evolutionary  Search  DCT  of  size  8  with  12  constants 

•  Q  =  eg  >  8.82,  exact  DCT  has  8.8259 

•  constant  bit  length  in  [0..31] 


Choosing  31  bits  for  all  constants:  126  additions 
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Summary  of  Search  Comparison 


Number  of  Additions  (fewer  is  better) 

8  pt.  DCT-II 
(8.82  dB  eg) 

16  pt.  DFT 
(conv.  err  =  1) 

32  pt.  DCT-II 
(LC  MP3) 

18x36  IMDCT 
(LC  MP3) 

initial  (31  bits) 

126 

500 

1222 

643 

global 

40 

168 

408 

182 

evol. 

36 

185 

490 

212 

greedy 

(top-down) 

56 

158 

417 

170 

greedy 

(bottom-up) 

57 

154 

n/a 

n/a 

One  search  method  alone  is  not  sufficient  —  each 
search  performs  differently  depending  on  transform 
and  quality  measure 
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Approximation  ofDCT  within  JPEG 


♦  Approximate  DCT-II  inside  JPEG  while  retain  images 
of  reasonable  quality 

-  Q  =  Peak  Signal  to  Noise  Ratio  (decibels)  of  decompressed 
JPEG  image  against  the  original  uncompressed  input  image. 


PSNR  =  20xlogio 


^  255  ^ 

vRMSEy 


RMSE  = 


1  512x512 


512  512 

i  j 


-  Q  Threshold 

•  Test  Image:  Lena,  512x512  pixel,  8-bit  grayscale 

•  PSNR  must  be  at  least  30  decibels  or 

image  becomes  noticeably  lossy). 
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Approximation  ofDCT  within  JPEG 


♦  Before  approximating,  the  original  DCT*  requires  261  additions  and 
produces  a  Lena  image  with  a  PSNR  of  37.6462  dB. 


Method 

#  Additions 

PSNR 

global 

37 

30.0354 

evolutionary 

67 

36.5323 

greedy  (t-d) 

28 

32.4503 

♦  Compare  constants  global  vs.  greedy  search: 

-  Global:  [  3/2,  3/2,  3/2,  3/2,  3/2,  3/2,  3/2,  1/2,  -1/2, 1 , 

-1/2,  -1/2,  1/2,  -1/2,  -1,  1,  -1,  -1/4,  1/2,  -1/4] 

-  Greedy:  [3/2,  1,  1,  1,  1,  1,  1,1/2,  -1/2,  1,  -1/2, 

0,  1/2,0,  -1,  1,-1, 0,  1/2, -1/4] 

-  Greedy  succeeds  in  zeroing  3  constants  that  affect  the  high  frequency 
(HF)  outputs  ‘thrown  away’  by  JPEG 

*Base  on  source  from  Independent  JPEG  Group  (IJG),  http://www.ijg.org 
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Abstract 


Presentation 
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Next  Abstract 


Summary 


♦  Application  specific  tuning  yields  ample  opportunities 
for  optimization 

♦  The  optimization  flow  can  be  automated 

-  algorithm  selection  and  manipulation 

-  arithmetic  reduction  through  search 

-  arbitrary  quality  measures  supported 

♦  Details  of  the  arithmetic  reduction  is  non-trivial 

-  non-monotonic  relation  between  Q  and  C 

-  different  search  methods  succeed  in  different  scenarios 

♦  The  results  of  this  study  needs  to  be  combined  with 
other  aspects  of  DSP  domain-specific  high-level 
synthesis 
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